{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 调度系统"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在360内部我们使用[XLearning](https://github.com/Qihoo360/XLearning)完成TensorNet程序的提交和调度。不过TensorNet不局限于使用XLearning调度，任何可以管理MPI任务的调度系统都可以。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 作业提交"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "提交作业的时候除了将自己模型的Python文件提交到集群外，还需要将依赖的包一并打包上去。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "$ FILES=\"tensornet.tar.gz,wide-deep.py,job_main.sh\"\n",
    "$ MPI_QUEUE=\"xxxx\"\n",
    "$ CMD=\"sh job_main.sh\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**job_main.sh**的内容如下"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "$ cat job_main.sh\n",
    "\n",
    "# 解压tensornet包，准备环境\n",
    "function prepare_env() {\n",
    "    tar -zxvf lib.tar.gz > /dev/null\n",
    "\n",
    "    (\n",
    "        mkdir -p tn\n",
    "        cd tn\n",
    "        mv ../tensornet.tar.gz .\n",
    "        tar -zxvf tensornet.tar.gz > /dev/null\n",
    "    )\n",
    "\n",
    "    return 0\n",
    "}\n",
    "\n",
    "function start_job()\n",
    "{\n",
    "    export PYTHONPATH=\"$(pwd)/tn:${PYTHONPATH}\"\n",
    "    export LD_LIBRARY_PATH=\"$(pwd)/lib:${JAVA_HOME}/jre/lib/amd64/server:${HADOOP_HOME}/lib/native\"\n",
    "\n",
    "    (\n",
    "        set -o xtrace\n",
    "\n",
    "        ${PYTHON3} wide-deep.py                         \\\n",
    "            --data_dir=\"${YARN_HDFS}${TF_RECORD_PATH}\"  \\\n",
    "            --match_pattern='part*'\n",
    "    )\n",
    "\n",
    "    return 0\n",
    "}\n",
    "\n",
    "prepare_env\n",
    "start_job"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "执行提交运行。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "$ mpi-submit \\\n",
    "   --app-type \"tensornet\"                \\\n",
    "   --app-name \"wide-deep\"                \\\n",
    "   --worker-num 50                       \\\n",
    "   --worker-memory 50G                   \\\n",
    "   --worker-cores 6                      \\\n",
    "   --files \"${FILES}\"                    \\\n",
    "   --queue \"${MPI_QUEUE}\"                \\\n",
    "   --launch-cmd \"${CMD}\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## NOTICE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在我们使用XLearning的时候没有使用它可以指定`ps`节点的功能，只需要指定`worker`即可，如果使用其它调度工具应该不需要操心这个。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
